BERT End to End (Fine-tuning + Predicting) in 5 minutes with Cloud TPU¶
Overview¶
BERT, or Bidirectional Embedding Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.
This Colab demonstates using a free Colab Cloud TPU to fine-tune sentence and sentence-pair classification tasks built on top of pretrained BERT models and run predictions on tuned model. The colab demonsrates loading pretrained BERT models from both TF Hub and checkpoints.
Note: You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud Storage) bucket for this Colab to run.
Please follow the Google Cloud TPU quickstart for how to create GCP account and GCS bucket. You have $300 free credit to get started with any GCP product. You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.
This notebook is hosted on GitHub. To view it in its original repository, after opening the notebook, select File > View on GitHub.
Learning objectives¶
In this notebook, you will learn how to train and evaluate a BERT model using TPU.
Set up your TPU environment¶
Train on TPU
Create a Cloud Storage bucket for your TensorBoard logs at http://console.cloud.google.com/storage and fill in the BUCKET parameter in the "Parameters" section below.
On the main menu, click Runtime and select Change runtime type. Set "TPU" as the hardware accelerator.
- Click Runtime again and select Runtime > Run All (Watch out: the "Colab-only auth for this notebook and the TPU" cell requires user input). You can also run the cells manually with Shift-ENTER.
Set up your TPU environment¶
In this section, you perform the following tasks:
- Set up a Colab TPU running environment
- Verify that you are connected to a TPU device
- Upload your credentials to TPU to access your GCS bucket.
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf
import pandas as pd
from IPython.display import clear_output
assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)
from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
print('TPU devices:')
pprint.pprint(session.list_devices())
# Upload credentials to TPU.
with open('/content/adc.json', 'r') as f:
auth_info = json.load(f)
tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
# Now credentials are set for all future sessions on this TPU.
Import BERT modules¶
With your environment configured, you can now prepare and import the BERT modules. The following step clones the source code from GitHub and import the modules from the source. Alternatively, you can install BERT using pip (!pip install bert-tensorflow).
tf.__version__
!pip install bert-tensorflow
# import bert
# import tensorflow_hub as hub
# from bert import modeling
# from bert import tokenization
# from bert import optimization
# from bert import run_classifier
# from bert import run_classifier_with_tfhub
import sys
!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
sys.path += ['bert_repo']
# import python modules defined by BERT
import modeling
import optimization
import run_classifier
import run_classifier_with_tfhub
import tokenization
# import tfhub
import tensorflow_hub as hub
Prepare dataset¶
This next section of code performs the following tasks:
- Specify task and download training data.
- Specify BERT pretrained model
- Specify GS bucket, create output directory for model checkpoints and eval results.
TASK = "PTTMovieReviews" #@param {type:"string"}
assert TASK in ["WSDMFakeNews", "PTTMovieReviews"]
RAW_DATA_DIR = "raw_data"
!mkdir {RAW_DATA_DIR}
if TASK == "PTTMovieReviews":
TASK_DATA_DIR = os.path.join(RAW_DATA_DIR, "ptt_movie_review")
!mkdir {TASK_DATA_DIR}
!wget https://s3-ap-northeast-1.amazonaws.com/smartnews-dmp-tmp/meng/ptt_movie_review_tokenized.zip
!unzip ptt_movie_review_tokenized.zip -d {TASK_DATA_DIR}
!rm -rf {TASK_DATA_DIR}/__MACOSX
!mv {TASK_DATA_DIR}/PPT_Movie_Review_train-1.txt {TASK_DATA_DIR}/train.txt
!mv {TASK_DATA_DIR}/PPT_Movie_Review_test-1.txt {TASK_DATA_DIR}/test.txt
train = pd.read_csv(f"{TASK_DATA_DIR}/train.txt", header=None, sep="\t")
train.columns = ["label", "text"]
elif TASK == "WSDMFakeNews":
TASK_DATA_DIR = os.path.join(RAW_DATA_DIR, "wsdm_fakenews")
!mkdir {TASK_DATA_DIR}
zip_file = "drive-download-20190516T113709Z-001.zip"
file_url = "https://s3-ap-northeast-1.amazonaws.com/smartnews-dmp-tmp/meng/fake_news/" + zip_file
!wget {file_url}
!unzip {zip_file}
!mv dev_bert.tsv dev.tsv
!mv test_bert.tsv test.tsv
!mv train_bert.tsv train.tsv
!mv dev.tsv test.tsv train.tsv {TASK_DATA_DIR}
print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))
!ls $TASK_DATA_DIR
Bert tfhub modules:
BUCKET = 'tpu-training-result' #@param {type:"string"}
assert BUCKET, 'Must specify an existing GCS bucket name'
OUTPUT_DIR = 'gs://{}/bert-tfhub/models/{}'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))
# Available pretrained model checkpoints:
# uncased_L-12_H-768_A-12: uncased BERT base model
# uncased_L-24_H-1024_A-16: uncased BERT large model
# cased_L-12_H-768_A-12: cased BERT large model
# chinese_L-12_H-768_A-12
BERT_MODEL = 'chinese_L-12_H-768_A-12' #@param {type:"string"}
Define tokenizer¶
Now let's load tokenizer module from TF Hub and play with it.
BERT_MODEL_HUB = 'https://tfhub.dev/google/bert_' + BERT_MODEL + '/1'
# def create_tokenizer_from_hub_module():
# """Get the vocab file and casing info from the Hub module."""
# with tf.Graph().as_default():
# bert_module = hub.Module(BERT_MODEL_HUB)
# tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
# with tf.Session() as sess:
# vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
# tokenization_info["do_lower_case"]])
# return bert.tokenization.FullTokenizer(
# vocab_file=vocab_file, do_lower_case=do_lower_case)
tokenizer = run_classifier_with_tfhub.create_tokenizer_from_hub_module(BERT_MODEL_HUB)
tokenizer.tokenize("這是使用 BERT tokenizer 的一個例句")
Define task processors¶
from run_classifier import DataProcessor, InputExample, InputFeatures
import pandas as pd
class WSDMFakeNewsProcessor(DataProcessor):
"""Processor for WSDM - Fake News Classification Kaggle Competition
https://www.kaggle.com/c/fake-news-pair-classification-challenge
"""
def __init__(self):
self.language = "zh"
def get_train_examples(self, data_dir):
df = pd.read_csv(os.path.join(data_dir, "train.tsv"), sep="\t")
examples = []
for (i, line) in enumerate(df.itertuples()):
guid = "train-%d" % (i)
text_a = tokenization.convert_to_unicode(line[1])
text_b = tokenization.convert_to_unicode(line[2])
label = tokenization.convert_to_unicode(line[3])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_dev_examples(self, data_dir):
df = pd.read_csv(os.path.join(data_dir, "dev.tsv"), sep="\t")
examples = []
for (i, line) in enumerate(df.itertuples()):
guid = "dev-%d" % (i)
text_a = tokenization.convert_to_unicode(line[1])
text_b = tokenization.convert_to_unicode(line[2])
label = tokenization.convert_to_unicode(line[3])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_test_examples(self, data_dir):
df = pd.read_csv(os.path.join(data_dir, "test.tsv"), sep="\t").fillna('')
examples = []
for (i, line) in enumerate(df.itertuples()):
guid = "test-%d" % (i)
text_a = tokenization.convert_to_unicode(line[1])
text_b = tokenization.convert_to_unicode(line[2])
label = "unrelated"
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def get_labels(self):
"""See base class."""
return ["unrelated", "agreed", "disagreed"]
class PTTMovieReviewsProcessor(DataProcessor):
"""Processor for PTT Movie Reviews
"""
def __init__(self):
self.language = "zh"
def get_train_examples(self, data_dir):
df = pd.read_csv(os.path.join(data_dir, "train.txt"), header=None, sep="\t")
examples = []
for (i, line) in enumerate(df.itertuples()):
guid = "train-%d" % (i)
text_a = tokenization.convert_to_unicode(line[2])
label = tokenization.convert_to_unicode(line[1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
def get_dev_examples(self, data_dir):
df = pd.read_csv(os.path.join(data_dir, "test.txt"), header=None, sep="\t")
examples = []
for (i, line) in enumerate(df.itertuples()):
guid = "dev-%d" % (i)
text_a = tokenization.convert_to_unicode(line[2])
label = tokenization.convert_to_unicode(line[1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
def get_labels(self):
"""See base class."""
return ["N", "P"]
if TASK == "PTTMovieReviews":
processor = PTTMovieReviewsProcessor()
elif TASK == "WSDMFakeNews":
processor = WSDMFakeNewsProcessor()
label_list = processor.get_labels()
print("processor:", processor)
print("label_list:", label_list)
Define hyperparameters¶
TRAIN_BATCH_SIZE = 256
EVAL_BATCH_SIZE = 8
PREDICT_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 200.0
MAX_SEQ_LENGTH = 256
# Warmup is a period of time where hte learning rate
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000
SAVE_SUMMARY_STEPS = 500
# Compute number of train and warmup steps from batch size
train_examples = processor.get_train_examples(TASK_DATA_DIR)
num_train_steps = int(len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)
# Setup TPU related config
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
NUM_TPU_CORES = 8
ITERATIONS_PER_LOOP = 1000
def get_run_config(output_dir):
return tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
model_dir=output_dir,
save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=ITERATIONS_PER_LOOP,
num_shards=NUM_TPU_CORES,
per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))
Fine-tune and Run Predictions on a pretrained BERT Model from TF Hub¶
This section demonstrates fine-tuning from a pre-trained BERT TF Hub module and running predictions.
OUTPUT_DIR
# Force TF Hub writes to the GS bucket we provide.
os.environ['TFHUB_CACHE_DIR'] = OUTPUT_DIR
model_fn = run_classifier_with_tfhub.model_fn_builder(
num_labels=len(label_list),
learning_rate=LEARNING_RATE,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
use_tpu=True,
bert_hub_module_handle=BERT_MODEL_HUB
)
estimator_from_tfhub = tf.contrib.tpu.TPUEstimator(
use_tpu=True,
model_fn=model_fn,
config=get_run_config(OUTPUT_DIR),
train_batch_size=TRAIN_BATCH_SIZE,
eval_batch_size=EVAL_BATCH_SIZE,
predict_batch_size=PREDICT_BATCH_SIZE,
)
At this point, you can now fine-tune the model, evaluate it, and run predictions on it.
# Train the model
def model_train(estimator):
# We'll set sequences to be at most 128 tokens long.
train_features = run_classifier.convert_examples_to_features(
train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('***** Started training at {} *****'.format(datetime.datetime.now()))
print(' Num examples = {}'.format(len(train_examples)))
print(' Batch size = {}'.format(TRAIN_BATCH_SIZE))
tf.logging.info(" Num steps = %d", num_train_steps)
train_input_fn = run_classifier.input_fn_builder(
features=train_features,
seq_length=MAX_SEQ_LENGTH,
is_training=True,
drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print('***** Finished training at {} *****'.format(datetime.datetime.now()))
FakeNews dataset size: 288512
- TPU: 1epoch -> 1024 (Wall time: 18min 8s)
- GPU: 1epoch -> 1.2 sec / 1 steps * 9016 stesp (32 bsize) -> 180m
- CPU: 1epoch -> 45 sec / 1 step 9016 steps (9016 32 bsize = dataset)
# %%time
# model_train(estimator_from_tfhub)
def model_eval(estimator):
# Eval the model.
eval_examples = processor.get_dev_examples(TASK_DATA_DIR)
eval_features = run_classifier.convert_examples_to_features(
eval_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('***** Started evaluation at {} *****'.format(datetime.datetime.now()))
print(' Num examples = {}'.format(len(eval_examples)))
print(' Batch size = {}'.format(EVAL_BATCH_SIZE))
# Eval will be slightly WRONG on the TPU because it will truncate
# the last batch.
eval_steps = int(len(eval_examples) / EVAL_BATCH_SIZE)
eval_input_fn = run_classifier.input_fn_builder(
features=eval_features,
seq_length=MAX_SEQ_LENGTH,
is_training=False,
drop_remainder=True)
result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
print('***** Finished evaluation at {} *****'.format(datetime.datetime.now()))
output_eval_file = os.path.join(OUTPUT_DIR, "eval_results.txt")
with tf.gfile.GFile(output_eval_file, "w") as writer:
print("***** Eval results *****")
for key in sorted(result.keys()):
print(' {} = {}'.format(key, str(result[key])))
writer.write("%s = %s\n" % (key, str(result[key])))
# model_eval(estimator_from_tfhub)
PREDICT_BATCH_SIZE
def model_predict(estimator):
# Make predictions on a subset of eval examples
prediction_examples = processor.get_dev_examples(TASK_DATA_DIR)[:PREDICT_BATCH_SIZE]
input_features = run_classifier.convert_examples_to_features(prediction_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=True)
predictions = estimator.predict(predict_input_fn)
for example, prediction in zip(prediction_examples, predictions):
print('text_a: %s\ntext_b: %s\nlabel:%s\nprediction:%s\n' % (example.text_a, example.text_b, str(example.label), prediction['probabilities']))
# model_predict(estimator_from_tfhub)
Visualization¶
from google.colab import auth
auth.authenticate_user()
# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = 'general-186304'
!gcloud config set project {project_id}
# Download the file from a given Google Cloud Storage bucket.
!mkdir models
!mkdir models/{TASK}
!gsutil cp -r gs://{BUCKET}/bert-tfhub/models/{TASK} models/
# # Print the result to make sure the transfer worked.
# !cat /tmp/gsutil_download.txt
!pip install pytorch-pretrained-bert
from google.colab import files
uploaded = files.upload()
!cp bert_config.json models/{TASK}
!cp models/{TASK}/model.ckpt-1768.data-00000-of-00001 models/{TASK}/model.ckpt.data-00000-of-00001
!cp models/{TASK}/model.ckpt-1768.index models/{TASK}/model.ckpt.index
!cp models/{TASK}/model.ckpt-1768.meta models/{TASK}/model.ckpt.meta
BERT_BASE_DIR=f"models/{TASK}"
!pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
{BERT_BASE_DIR}/model.ckpt \
{BERT_BASE_DIR}/bert_config.json \
{BERT_BASE_DIR}/pytorch_model.bin
clear_output()
%load_ext autoreload
%autoreload 2
import sys
!test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo"
!rm -r bertviz_repo # Uncomment if you need a clean pull from repo
!test -d bertviz_repo || git clone https://github.com/leemengtaiwan/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
sys.path += ['bertviz_repo']
from bertviz import attention, visualization
from bertviz.pytorch_pretrained_bert import BertModel, BertTokenizer
!ls /usr/local/share/jupyter/nbextensions/google.colab/
!find / -name require.js
from google.colab import files
files.download('/usr/local/lib/python3.6/dist-packages/notebook/static/components/requirejs/require.js')
%%javascript
require.config({
paths: {
d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min'
}
});
def call_html():
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
"d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
},
});
</script>
'''))
# TASK = "PTTMovieReviews"
from IPython.display import clear_output
bert_version = f'models/{TASK}'
model = BertModel.from_pretrained(bert_version, from_tf=True)
clear_output()
from google.colab import files
uploaded = files.upload()
!cp vocab.txt models/{TASK}
tokenizer = BertTokenizer.from_pretrained(bert_version)
sentence_a = "老爷爷自用的咽炎偏方,早上喝这个,3天见效,治一个好一个!"
sentence_b = "咽炎最佳治疗方法 这些小偏方治疗咽炎超管用"
attention_visualizer = visualization.AttentionVisualizer(model, tokenizer)
tokens_a, tokens_b, attn = attention_visualizer.get_viz_data(sentence_a, sentence_b)
call_html()
attention.show(tokens_a, tokens_b, attn)
